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The holographic bound in physics constrains the complexity of life. The finite storage capability of 
information in the observable universe requires the protein linguistics in the evolution of life. We find 
that the evolution of genetic code determines the variance of amino acid frequencies and genomic 
GC content among species. The elegant linguistic mechanism is confirmed by the experimental 
observations based on all known entire proteomes. 
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The phenomenon of life is governed by the general prin- 
ciples in physics, so the progress in understanding the 
physical world may provide new insight into the origin 
and evolution of life. The analogy between biology and 
linguistics at the level of sequences hints that the bio- 
information is processed by underlying linguistic rules. 
Several attempts have been made to combine linguistic 
theory with biology []]]. But the existence of linguistics 
in the biomacromolccular sequences needs a physical ex- 
planation. The holographic bound, intimately related to 
the holographic principle, came from the deep insights 
of Bekenstein and Hawking in 70 's 00- Its validity 
is insured by the second law of thermodynamics. In- 
terestingly, the problem on the existence of linguistics 
in the biomacromolecular sequences can be explained by 
the holographic bound. In the past decades, the biology 
has been changed greatly. Wada advocated "... to deter- 
mine the 'first principles' of bio-sciences and link them 
with the first principles of non-bio-sciences in order to 
understand the complex systems." and Gilbert also em- 
phasized the importance of the theoretical methods in 
biology Q. Nowadays, the intimacy between biology and 
physics is unprecedented. Considering the significant role 
of information either in physics or in biology H [f| , the 
gap between physics and biology may be bridged from 
the viewpoint of information. 

In the post-genomic era, the number of entire pro- 
teomes increases rapidly. We can take all known entire 
proteomes as samples to study the global properties of life 
on our planet. The variance of GC content [7j is a global 
property, which varies greatly among species. Wc also 
found another global property of the evolution of amino 
acid frequencies though they vary slightly. The mecha- 
nism of the variance of genomic GC content and amino 
acid frequencies was a long-standing and far-reaching 
problem ||. The genetic code evolved in the context of 
four-letter alphabet when the 20 amino acids joined pro- 
tein sequences chronologically The nature of prime 
biased AT/GC pressure and the reason for the correla- 
tion of GC content between total genomic DNA and the 
1st, 2nd and 3rd codon positions were unknown. The 
profound mechanism behind the variance of amino acid 
frequencies has not been studied; it is worse that the 



amino acid frequencies are routinely assumed to be con- 
stant. All of these basic problems in biology are solved in 
our theoretical framework based on the formal linguistics 
and the evolution of genetic code. 

In this paper, firstly we explain the existence of pro- 
tein linguistics and the limited complexity of life in the 
universe in terms of the holographic bound. Secondly, a 
linguistic model is proposed to reveal the mechanism of 
the evolution of amino acid frequencies and genomic GC 
content as well as the protein length distribution. The ex- 
cellent fit between our simulations and the experimental 
observations strongly supports the linguistic mechanism, 
where the experimental observations are based on the 
data of 106 entire proteomes (85 eubacteria, 12 archae- 
bacteria, 7 eukaryotes and 2 viruses) in database PEP 
[Io| and the data of GC content in database Genome 
Properties system [ll|. The "information" is the thread 
of the paper, which connects traditionally irrelative prob- 
lems in physics and biology with each other. 

According to the holographic bound, which states that 
the information storage capacity of a spatially finite sys- 
tem is limited by a quarter of its boundary area measured 
in Plank area unless the second law of thermodynamics 
is untrue, the entropy S in a volume of radius R satisfies 

S^S max ^(^) 2 , (1) 

where l p is the Plank length. There is much astronom- 
ical evidence that our universe may be headed for an 
infinite deSitter space. The holographic bound can be 
applied to the observable universe with finite event hori- 
zon. Therefore we can estimate the upper limit of the 
information storage capability of the observable universe 
as Iuniv ~ 10 122 bits [3j. In the point of view in physics, 
the entropy in our universe is given primarily by the num- 
ber of black body cosmic background photons, ~ 10 90 , 
which is definitely less than I un i V . 

However, the upper limit of information J„„ is not 
so large when considering the information of a system of 
life. Firstly, let us give a reasonable definition of a "liv- 
ing" system from the viewpoint of information. Many 
restless functional proteins, composed of 20 amino acids 
(a.a.), distinguish the life from the lifeless matter. Thus, 
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a general living system L(n) is defined as a set of all pos- 
sible proteins with length no more than n a. a., each of 
which is in either the folded state or the unfolded state. 
The maximum length n indicates the complexity of the 
system. Then, let's calculate the information of the sys- 



tem. The number of states of L(n) is f2(n) 
hence its information is 

J(n) = log 2 fL(ri) w 20" bits. 
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The exquisite single-chain structure of proteins can pro- 
vide much more information storage capacity than life- 
less matter. The upper limit of information I un i V forbids 
L(n), n > no to exist in our universe, where no = 94 a. a. 
such that I (ng) ~ l U mv Interestingly, the most frequent 
protein length for the life on our planet is about no- 

The actual system of life on the planet is not one of 
L(n), because the average protein length for different 
species, ranging from 250 a. a. to 500 a. a., are greater 
than no- Let the actual living system L eart h be the 
set of all possible proteins on the planet, and suppose 
n* (> no) be the maximum protein length. L eart h must 
be a proper subset of L(n*) because the information of 
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is bounded by I u 



According to the linguis- 



tic theory, L eart h is a language over the alphabet of 20 
amino acids. So we have demonstrated the existence of 
protein linguistics in terms of the holographic bound that 
can be derived from the second law of thermodynamics. 
In early evolution when delivering the genetic informa- 
tion from the RNA world to the DNA-protein world, the 
holographic bound required the grammars to allow a very 
small part of protein sequences and to forbid all the oth- 
ers. There is a easy way to implement the forbiddance: 
the observable universe can not accommodate all the pro- 
teins in L(n*). In fact, our conclusion is based on the 
general principle, freedom of the subtleties of the hierar- 
chy. 

If L(n) have an additional property, e.g., chirality, 
L r (n) (L l (n)) become the set of proteins composed of 
right-handed (left-handed) amino acids. Let C(n) be the 
set of all possible chiral L(n). Only up to n = 2 a.a., 
the information of C(2), i.e., I c (2) = 0(2) w 10 120 bits 
is near I un iv Chirality might have brought too much 
redundant information; broken symmetry was the solu- 
tion. So entropy bound is a strong law which constrains 
the forms of possible life in general. Let us imagine the 
most complex creature with the same height of us be the 
one whose genetic information is stored in Plank scale. 
The information stored in its body can be estimated as 
1/lp ~ 10 105 bits, which violates the holographic bound. 
So such complex creature can not exist. 

So far, we are aware that the linguistics must play a 
significant role in generating proteins in the primordial 
time. In the following, a linguistic model is proposed to 
simulate the variance of the amino acid frequencies and 
the genomic GC contents as well as the protein length 
distributions. 
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FIG. 1: Evolution of amino acid frequencies. The 20 

amino acids are aligned chronologically. The variance for each 
amino acid in simulation fits the experimental observation, 
(a) Experimental observation base on the data of 106 species. 
For each amino acid, the 106 species are aligned from left to 
right by iiio/io- (b) Simulation by the linguistic model. The 
30 simulated proteomes are aligned by t, which increases from 
0.02 to 0.40 by equal steps. (The simulations in Fig. 1, Fig. 
2 and Fig. 4 are obtains together by the linguistic model.) 



The model consists of three parts: ( i) g enerate protein 
sequence by tree adjoining grammar [12j ; (it) set amino 
acid for the leaves of grammars in (i) according to the tree 
of genetic code multiplicity (can be obtained from sym- 
metry analysis, s ee 11 31) ^ with consideration of the amino 
acid chronology [14] : and (Hi) translate the protein se- 
quences to the DNA sequences according to genetic code 
chronology [l5| . The evolution of genetic code is the core 
of the model. There is a variant t in the model, which 
represents the time in evolution. A proteome for a species 
is defined as many a protein generated by the model with 
fixed t, so t also identifies species in the model. Thus, the 
amino acid frequencies and the average protein length 
for a species can be calculated. The evolutionary trends 
of the amino acid frequencies can be determined when 
proteomes are generated at different time t. We can also 
simulate the evolution of genomic GC content after trans- 
lating the protein sequences to DNA sequences. 

The evolution of amino acid frequencies can be ex- 
plained by the model. According to the consensus 
chronology of amino acids to recruit into the genetic code 
from the earliest to the latest 
L, T, R, Q, I, N, H, K, C, F 

106 species by the ratio i?io/io of average frequency for 
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G, A, D, V, P, S, E, 
Y, M, W, we sort the 



10 later amino acids to average frequency for 10 earlier 
amino acids. Then we obtain the evolutionary trends of 
amino acid frequencies: the frequencies of G, A, D, V, 
P, T, R, H, W decrease, while the frequencies of S, E, I, 
N, K, F, Y increase and the frequencies of L, Q, C, M 
do not vary obviously (Fig. la). The variance of amino 
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FIG. 2: Evolution of genomic GC content, (a) Relation- 
ship between genomic GC content and -Rin/in f° r the species 
in database PEP (dots) and its simulation by the linguistic 
model (solid line, also decreasing), (b) Simulation of the cor- 
relation of the GC content between total genomic DNA and 
the 1st, 2nd, and 3rd codon positions, which agrees with the 
experimental observation in detail (see fig. 5 in [TJ, fig. 2 in 
and fig. 9-1 in [|). 



FIG. 3: Rainbow distribution. Relationship between the 
average protein length and the highest frequency of the dis- 
crete fourier transformation of protein length distribution (the 
cutoff is 3000) for each of the 106 species. The distribution 
of the species from three domains likes a rainbow. Even for 
the group of closely related species such as mycoplasmas (be- 
longing to eubacteria), their distribution also form an "arch" 
of the rainbow. 



acid frequencies are amazingly monotonic by and large. 
Therefore, it is reasonable to assume that a mechanism 
underlies the evolution of amino acid frequencies. 

The simulation of our linguistic model [Fig. lb] agrees 
with the data of 106 species [Fig. la] not only in the evo- 
lutionary trends but also in the variance magnitudes for 
most of the amino acids. Note that no parameter is added 
on purpose in the model to alter the trend for a certain 
amino acid. The evolution of amino acid frequency are 
sensitive to the amino acid multiplicities [l3| , any disobe- 
dience of which would spoil the results. Therefore, it is 
the evolution of genetic code that determines the evolu- 
tion of amino acid frequencies. An important property of 
the model is that the parameters of amino acid frequen- 
cies are constant, which indicates that the variance of 
amino acid frequencies developed during a short period. 
It agrees that the genetic code had accomplished quickly. 
Recently, Jordan et al observed the contemporary amino 
acid gain and loss, about which there were different ex- 
planations [l(| • We believe that the evolution of genetic 
code drives the amino acid gain and loss. 

The genomic GC content decreases linearly with i?io/io 
for the species in database PEP. The simulation of our 
model agrees qualitatively with this experimental obser- 
vation [Fig. 2a]. In our model, the evolution of amino 
acid frequency and the evolution of genomic GC content 
are driven by a common variant t. A protein sequence 
generated at later time t corresponds to the DNA se- 
quence translated using the later codons 15|, which re- 
sults in the relationship between genomic GC content 



and -Rio/ io- 

And the simulation of the correlation of GC content 
between total genomic DNA and the first, second, and 
third codon positions [Fig. 2b] also agrees with the re- 
sults based on the data of completed genomes [3] H 12 1 , 
where the correlation slope of the third codon position 
is much greater than that of the first and the second 
positions. There is a characteristic convex in the mid- 
dle of the line of the simulated GC content for the first 
codon position, which agrees dramatically with the ex- 
perimental observations [3] H • In the table of codon 
chronology [l5| , G and C (A and U) occupy all the third 
positions of earliest (latest) codons for 20 amino acids, 
while the bases appear about equally for the first and 
second positions. Therefore, the correlation slope for the 
first and second positions vary slightly while the slope 
for the third position varies greatly. And the lower limit 
I ~ 0.3 and upper limit u ~ 0.7 of the GC content among 
species can be explained similarly; I + u = 1 is required 
in theory. 

The linguistic mechanism can also be supported by the 
distribution of protein length. When observing the dis- 
tribution of species in the space of average protein length 
and the highest frequency of discrete fourier transfor- 
mation of protein length distribution, we unexpectedly 
found that the species for the three domains gathered in 
three parallel arches respectively, which likes a rainbow 
[Fig 3]. This indicates that the fluctuations in protein 
length distributions can not be products of a stochastic 
process. The characters of the protein length distribu- 
tion (bell-shape profile, periodic-like fluctuations) have 
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the particular variance of amino acid frequencies and GC 
content for the contemporary species are the products of 
certain genetic code multiplicity and amino acid chronol- 
ogy evolved in primordial time. The linguistic model 
succeeds not only in the simulations of respective aspects 
(amino acid frequencies [Fig. 1], GC content [Fig. 2b], 
protein length distribution [Fig. 3]) but also in their re- 
lationships (amino acid frequencies and GC content [Fig. 
2a], amino acid frequencies and protein length distribu- 
tion [Fig. 4]). So the thorough and detailed fit between 
simulations and experimental observations confirms the 
validity of the linguistic framework, which is grounded in 
general principles in physics. 

We thank Hefeng Wang, Liu Zhao, Donald R. Fors- 
dyke, Zhenwei Yao for valuable discussions. Supported 
by NSF of China Grant No. of 10374075. 



FIG. 4: The evolutionary flow. Relationship between av- 
erage protein lengths and Rhqw/gv f° r the 106 species. The 
species of three domains (Archaebacteria: blue square, Eu- 
bacteria: dot, Eukaryotes: red circle) gather together in re- 
spective regions and all the species form an evolutionary flow. 
The proteome size is represented proportionally by the tail 
under each species (big: red, middle: green and small: blue); 
species with big genome sizes locate in the midstream of the 
evolutionary flow. (Embedded) Simulation of the evolution- 
ary flow, whose (upward) bending direction agrees with the 
direction of the experimental observation. 

been simulated by the linguistic model, which should be 
intrinsic properties related to underlying grammars [l8l |. 

We also find the relationship between the average pro- 
tein length and the ratio of amino acid frequencies. The 
species of three domains gather in different regions in 
the space of the average protein length and the ratio 
Rhqw/gv of average frequency for several later amino 
acids (H, Q, W) to average frequency for several earlier 
ones (G, V) [Fig. 4]. The points of all species form a 
bending line [Fig. 4], which can be explained as an evo- 
lutionary flow in that (i) the species with large (small) 
genome locate in the midstream (margin) of the flow [Fig. 
4] and (ii) the (rightward) evolutionary direction paral- 
lels the directions of decreasing correlations of protein 
length distributions among groups of the closely related 
species. The evolutionary flow can be simulated by our 
model [Fig. 4, Embedded]. The evolutionary direction 
and the bending direction in the simulation agree with 
the evolutionary flow of the 106 species. 

In conclusion, the holographic bound improves our un- 
derstanding of life, which supervises the maximum com- 
plexity of life. Linguistics is necessary in storage of in- 
formation in the protein/DNA sequences. We show that 
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